The purpose of this assignment is to explore, analyze and model a data set containing approximately 2200 records, each record representing a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
We will wrangle and clean the dataset and then create three possible models: a base model to compare the others with, a model with transfomed variables, and a model with certain records eliminated.
One grouping that can be shown to have an impact is a cohort of 102 records with missing values for both strikeouts and stolen bases. This cohort shows clear interaction with the other variables though it is not clear why.
There is a high degree of multicollinearity.
At Least one column is implausible - Home runs allowed is not only highly correlated with Home Runs made (.92) but over 50% of the values are an exact match.
We begin with an initial exploration of the dataset, which has 2,276 rows and 17 columns.
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | INDEX | 0 | 1.0000000 | 1268.46353 | 736.34904 | 1 | 630.75 | 1270.5 | 1915.50 | 2535 | <U+2587><U+2587><U+2587><U+2587><U+2587> |
| numeric | TARGET_WINS | 0 | 1.0000000 | 80.79086 | 15.75215 | 0 | 71.00 | 82.0 | 92.00 | 146 | <U+2581><U+2581><U+2587><U+2585><U+2581> |
| numeric | BATTING_H | 0 | 1.0000000 | 1469.26977 | 144.59120 | 891 | 1383.00 | 1454.0 | 1537.25 | 2554 | <U+2581><U+2587><U+2582><U+2581><U+2581> |
| numeric | BATTING_2B | 0 | 1.0000000 | 241.24692 | 46.80141 | 69 | 208.00 | 238.0 | 273.00 | 458 | <U+2581><U+2586><U+2587><U+2582><U+2581> |
| numeric | BATTING_3B | 0 | 1.0000000 | 55.25000 | 27.93856 | 0 | 34.00 | 47.0 | 72.00 | 223 | <U+2587><U+2587><U+2582><U+2581><U+2581> |
| numeric | BATTING_HR | 0 | 1.0000000 | 99.61204 | 60.54687 | 0 | 42.00 | 102.0 | 147.00 | 264 | <U+2587><U+2586><U+2587><U+2585><U+2581> |
| numeric | BATTING_BB | 0 | 1.0000000 | 501.55888 | 122.67086 | 0 | 451.00 | 512.0 | 580.00 | 878 | <U+2581><U+2581><U+2587><U+2587><U+2581> |
| numeric | BATTING_SO | 102 | 0.9551845 | 735.60534 | 248.52642 | 0 | 548.00 | 750.0 | 930.00 | 1399 | <U+2581><U+2586><U+2587><U+2587><U+2581> |
| numeric | BASERUN_SB | 131 | 0.9424429 | 124.76177 | 87.79117 | 0 | 66.00 | 101.0 | 156.00 | 697 | <U+2587><U+2583><U+2581><U+2581><U+2581> |
| numeric | BASERUN_CS | 772 | 0.6608084 | 52.80386 | 22.95634 | 0 | 38.00 | 49.0 | 62.00 | 201 | <U+2583><U+2587><U+2581><U+2581><U+2581> |
| numeric | BATTING_HBP | 2085 | 0.0839192 | 59.35602 | 12.96712 | 29 | 50.50 | 58.0 | 67.00 | 95 | <U+2582><U+2587><U+2587><U+2585><U+2581> |
| numeric | PITCHING_H | 0 | 1.0000000 | 1779.21046 | 1406.84293 | 1137 | 1419.00 | 1518.0 | 1682.50 | 30132 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | PITCHING_HR | 0 | 1.0000000 | 105.69859 | 61.29875 | 0 | 50.00 | 107.0 | 150.00 | 343 | <U+2587><U+2587><U+2586><U+2581><U+2581> |
| numeric | PITCHING_BB | 0 | 1.0000000 | 553.00791 | 166.35736 | 0 | 476.00 | 536.5 | 611.00 | 3645 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | PITCHING_SO | 102 | 0.9551845 | 817.73045 | 553.08503 | 0 | 615.00 | 813.5 | 968.00 | 19278 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | FIELDING_E | 0 | 1.0000000 | 246.48067 | 227.77097 | 65 | 127.00 | 159.0 | 249.25 | 1898 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | FIELDING_DP | 286 | 0.8743409 | 146.38794 | 26.22639 | 52 | 131.00 | 149.0 | 164.00 | 228 | <U+2581><U+2582><U+2587><U+2586><U+2581> |
We can see that many columns have missing values. Batting_HPBA has too many so we will remove it.
While it is tempting to simply impute the values for NAs for the other columns, we need to ensure there isn’t any kind of grouping effect for the records with NA. This is because missing values may reflect an earlier era of baseball when these statistics weren’t collected.
First we look to see how well the missing values match across columns. We remove the pitching so rows and look to see what missing values are left:
## BATTING_SO PITCHING_SO BASERUN_CS BASERUN_SB FIELDING_DP
## 0 0 670 131 286
There is some cohort effect as there is complete duplication with pitching so and batting so, and almost complete overlap with baserun cs.
We can now try out four imputation strategies (missing cohort rows removed or kept, mean or median), regress on all the variables and compare their adjusted r squared.
Here are the results for the models in this order:
## [1] 0.313437
## [1] 0.3147084
## [1] 0.3169625
## [1] 0.3178529
There appears to be a minor effect. Imputing the mean to the other columns with NA and removing cohort records has a very small positive effect on the model.
Now we can look at interactions between the “cohort” and other variables:
The interaction analysis suggests that the cohort is not random - there are numerous interactions with many other variables, some of which are quite counterinutitive (team pitching H). So we could either do a random effects/flag/interactions or toss them. Becuase bad data is not reproducible I will toss, at the expense of better predicitons if I can identify the cohort in the eval data.
No we can look at the stats in ur new dataset.
## INDEX TARGET_WINS BATTING_H BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 640.2 1st Qu.: 71.00 1st Qu.:1389 1st Qu.:211.2
## Median :1275.5 Median : 82.00 Median :1458 Median :240.0
## Mean :1275.2 Mean : 80.76 Mean :1475 Mean :243.9
## 3rd Qu.:1923.8 3rd Qu.: 91.00 3rd Qu.:1541 3rd Qu.:275.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
## BATTING_3B BATTING_HR BATTING_BB BATTING_SO
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 48.0 1st Qu.:456.0 1st Qu.: 548.0
## Median : 46.00 Median :107.0 Median :517.0 Median : 750.0
## Mean : 54.45 Mean :103.4 Mean :505.1 Mean : 735.6
## 3rd Qu.: 71.00 3rd Qu.:148.0 3rd Qu.:582.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.0 Max. :878.0 Max. :1399.0
## BASERUN_SB BASERUN_CS PITCHING_H PITCHING_HR
## Min. : 0.0 Min. : 0.0 Min. : 1137 Min. : 0.0
## 1st Qu.: 66.0 1st Qu.: 44.0 1st Qu.: 1425 1st Qu.: 58.0
## Median :102.0 Median : 52.8 Median : 1521 Median :111.0
## Mean :121.1 Mean : 52.8 Mean : 1794 Mean :109.7
## 3rd Qu.:143.8 3rd Qu.: 55.0 3rd Qu.: 1694 3rd Qu.:152.8
## Max. :697.0 Max. :201.0 Max. :30132 Max. :343.0
## PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 479.2 1st Qu.: 615.0 1st Qu.: 126.0 1st Qu.:137.0
## Median : 542.0 Median : 813.5 Median : 155.0 Median :146.4
## Mean : 557.5 Mean : 817.7 Mean : 243.9 Mean :148.6
## 3rd Qu.: 614.8 3rd Qu.: 968.0 3rd Qu.: 234.0 3rd Qu.:162.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
There are 4 categories where 0s may be nas: Pitching and Batting HR and Pitching and batting SO. We look more closely at these categories:
We can check to see if the zeroes behave like nas or actual values. We compare the interaction with Pitching_h in both cases. They behave very differently, neither like the overall sample:
Looking for other gorups, Hard to say - there seems to be something about lower so being more negatively correlated with wins than later - but the ns may be small:
Will do nothing with outliers or na as zero for now
## [1] 0 12 14 17 21 22
## INDEX TARGET_WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB
## 1 1347 0 891 135 0 0 0
## BATTING_SO BASERUN_SB BASERUN_CS BATTING_HBP PITCHING_H PITCHING_HR
## 1 0 0 0 NA 24057 0
## PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## 1 0 0 1890 NA
Target_Wins appears normally distributed - the zero is suspicious but I’m going to leave it.
## INDEX TARGET_WINS BATTING_H BATTING_2B BATTING_3B
## INDEX 1.0000000000 -0.02928140 -0.03131390 -0.003976934 -0.00497585
## TARGET_WINS -0.0292813985 1.00000000 0.39476995 0.293205037 0.13685882
## BATTING_H -0.0313139014 0.39476995 1.00000000 0.540648272 0.45802046
## BATTING_2B -0.0039769341 0.29320504 0.54064827 1.000000000 -0.08532550
## BATTING_3B -0.0049758496 0.13685882 0.45802046 -0.085325497 1.00000000
## BATTING_HR 0.0413809930 0.19059035 -0.06194956 0.393641975 -0.63765753
## BATTING_BB -0.0358540809 0.23250609 -0.10545406 0.230196649 -0.28160593
## BATTING_SO 0.0814501106 -0.03175071 -0.46385357 0.162685188 -0.66978119
## BASERUN_SB 0.0435154747 0.11143414 0.14886129 -0.153728585 0.49301668
## BASERUN_CS 0.0004632733 0.01610843 0.01198251 -0.077632602 0.19833581
## PITCHING_H 0.0146890757 -0.11576530 0.29979491 0.008872511 0.20396690
## PITCHING_HR 0.0403725584 0.20531868 0.02082589 0.412455481 -0.56629509
## PITCHING_BB -0.0233549401 0.12063924 0.07067846 0.149565361 0.01294580
## PITCHING_SO 0.0558901457 -0.07843609 -0.25265679 0.064792315 -0.25881893
## FIELDING_E -0.0068738726 -0.17639551 0.28252119 -0.232247607 0.51354615
## FIELDING_DP 0.0061318975 -0.02860414 0.04535652 0.178563220 -0.21908499
## BATTING_HR BATTING_BB BATTING_SO BASERUN_SB BASERUN_CS
## INDEX 0.04138099 -0.03585408 0.08145011 0.04351547 0.0004632733
## TARGET_WINS 0.19059035 0.23250609 -0.03175071 0.11143414 0.0161084320
## BATTING_H -0.06194956 -0.10545406 -0.46385357 0.14886129 0.0119825143
## BATTING_2B 0.39364197 0.23019665 0.16268519 -0.15372858 -0.0776326024
## BATTING_3B -0.63765753 -0.28160593 -0.66978119 0.49301668 0.1983358054
## BATTING_HR 1.00000000 0.50439692 0.72706935 -0.39942181 -0.3034743273
## BATTING_BB 0.50439692 1.00000000 0.37975087 -0.06545891 -0.0861202523
## BATTING_SO 0.72706935 0.37975087 1.00000000 -0.23837153 -0.1566149092
## BASERUN_SB -0.39942181 -0.06545891 -0.23837153 1.00000000 0.2869124889
## BASERUN_CS -0.30347433 -0.08612025 -0.15661491 0.28691249 1.0000000000
## PITCHING_H -0.27656010 -0.46585690 -0.37568637 0.07198568 -0.0369545996
## PITCHING_HR 0.96659392 0.44681242 0.66717889 -0.36564098 -0.3034478040
## PITCHING_BB 0.10677385 0.47385394 0.03700514 0.14323815 -0.0542531880
## PITCHING_SO 0.18470756 -0.02075682 0.41623330 -0.05615058 -0.0686217842
## FIELDING_E -0.59891151 -0.66138116 -0.58466444 0.36999309 0.0236201201
## FIELDING_DP 0.33368751 0.32158157 0.14599850 -0.24957358 -0.1563091914
## PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E
## INDEX 0.014689076 0.04037256 -0.02335494 0.05589015 -0.006873873
## TARGET_WINS -0.115765302 0.20531868 0.12063924 -0.07843609 -0.176395507
## BATTING_H 0.299794910 0.02082589 0.07067846 -0.25265679 0.282521195
## BATTING_2B 0.008872511 0.41245548 0.14956536 0.06479231 -0.232247607
## BATTING_3B 0.203966905 -0.56629509 0.01294580 -0.25881893 0.513546149
## BATTING_HR -0.276560100 0.96659392 0.10677385 0.18470756 -0.598911507
## BATTING_BB -0.465856896 0.44681242 0.47385394 -0.02075682 -0.661381160
## BATTING_SO -0.375686369 0.66717889 0.03700514 0.41623330 -0.584664436
## BASERUN_SB 0.071985680 -0.36564098 0.14323815 -0.05615058 0.369993094
## BASERUN_CS -0.036954600 -0.30344780 -0.05425319 -0.06862178 0.023620120
## PITCHING_H 1.000000000 -0.16448724 0.31845282 0.26724807 0.672838853
## PITCHING_HR -0.164487236 1.00000000 0.19575531 0.20588053 -0.501758136
## PITCHING_BB 0.318452818 0.19575531 1.00000000 0.48849865 -0.016375919
## PITCHING_SO 0.267248074 0.20588053 0.48849865 1.00000000 -0.023291783
## FIELDING_E 0.672838853 -0.50175814 -0.01637592 -0.02329178 1.000000000
## FIELDING_DP -0.088957308 0.32336753 0.15211734 0.01039232 -0.257897297
## FIELDING_DP
## INDEX 0.006131897
## TARGET_WINS -0.028604138
## BATTING_H 0.045356517
## BATTING_2B 0.178563220
## BATTING_3B -0.219084985
## BATTING_HR 0.333687510
## BATTING_BB 0.321581568
## BATTING_SO 0.145998500
## BASERUN_SB -0.249573580
## BASERUN_CS -0.156309191
## PITCHING_H -0.088957308
## PITCHING_HR 0.323367525
## PITCHING_BB 0.152117341
## PITCHING_SO 0.010392318
## FIELDING_E -0.257897297
## FIELDING_DP 1.000000000
Invsteigate suspicious HR category
##
## Pearson's product-moment correlation
##
## data: dfTrain$PITCHING_HR and dfTrain$TARGET_WINS
## t = 9.1789, df = 2274, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1490846 0.2283275
## sample estimates:
## cor
## 0.1890137
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_HR, data = dfTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75.657 -9.956 0.636 10.055 67.477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.656920 0.646540 117.018 <2e-16 ***
## PITCHING_HR 0.048572 0.005292 9.179 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.47 on 2274 degrees of freedom
## Multiple R-squared: 0.03573, Adjusted R-squared: 0.0353
## F-statistic: 84.25 on 1 and 2274 DF, p-value: < 2.2e-16
## StudRes Hat CookD
## 299 4.380293 0.0006944747 0.0066141630
## 832 0.173993 0.0070267976 0.0001071615
## 964 -1.050146 0.0058117315 0.0032231964
## 1211 -4.919225 0.0017463018 0.0209523937
## 2233 -4.132563 0.0017463018 0.0148329515
##
## Pearson's product-moment correlation
##
## data: dfTrain2$PITCHING_HR and dfTrain2$TARGET_WINS
## t = 8.8525, df = 2269, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1426547 0.2221771
## sample estimates:
## cor
## 0.1827147
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_HR, data = dfTrain2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.949 -9.929 0.614 10.028 55.992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.948820 0.639361 118.789 <2e-16 ***
## PITCHING_HR 0.046356 0.005237 8.852 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.23 on 2269 degrees of freedom
## Multiple R-squared: 0.03338, Adjusted R-squared: 0.03296
## F-statistic: 78.37 on 1 and 2269 DF, p-value: < 2.2e-16
## StudRes Hat CookD
## 964 -1.039523 0.0058683344 0.003189286
## 982 -3.886600 0.0017628831 0.013255859
## 1810 2.114722 0.0049482791 0.011102505
## 1882 -1.303158 0.0058683344 0.005010737
## 2012 3.688318 0.0006272236 0.004245374
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -75.6569 -9.9562 0.6359 0.0000 10.0552 67.4774
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 2276 0 15.47 0.64 0.2 14.84 -75.66 67.48 143.13 -0.18 0.86
## se
## X1 0.32
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_HR, data = dfTrain_WithoutHR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.208 -9.802 0.653 9.952 66.914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.624136 0.636539 120.376 < 2e-16 ***
## PITCHING_HR 0.041723 0.005197 8.028 1.58e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.09 on 2263 degrees of freedom
## Multiple R-squared: 0.02769, Adjusted R-squared: 0.02726
## F-statistic: 64.45 on 1 and 2263 DF, p-value: 1.576e-15
## StudRes Hat CookD
## 299 4.4557422 0.0007060697 0.0069560265
## 829 0.2703629 0.0070966028 0.0002613277
## 856 -3.7394216 0.0014507753 0.0101000850
## 961 -0.9956483 0.0058665293 0.0029249611
## 1804 2.1826581 0.0049451007 0.0118181032
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_HR + HR_Low, data = dfTrain_BiModal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75.692 -9.976 0.653 10.058 67.556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.529253 1.069339 70.632 < 2e-16 ***
## PITCHING_HR 0.049398 0.007641 6.465 1.24e-10 ***
## HR_Low 0.162504 1.084033 0.150 0.881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.47 on 2273 degrees of freedom
## Multiple R-squared: 0.03574, Adjusted R-squared: 0.03489
## F-statistic: 42.12 on 2 and 2273 DF, p-value: < 2.2e-16
##
## Welch Two Sample t-test
##
## data: dfLowHR$TARGET_WINS and dfHighHR$TARGET_WINS
## t = -5.4141, df = 753, p-value = 8.291e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.665804 -3.118167
## sample estimates:
## mean of x mean of y
## 77.11327 82.00526
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_HR, data = dfHighHR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.641 -9.293 0.650 9.127 67.238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.107959 0.957161 79.514 < 2e-16 ***
## PITCHING_HR 0.044983 0.006848 6.569 6.72e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.73 on 1709 degrees of freedom
## Multiple R-squared: 0.02463, Adjusted R-squared: 0.02405
## F-statistic: 43.15 on 1 and 1709 DF, p-value: 6.72e-11
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2276 6.09 15.1 2 2.93 2.97 -2 249 251 6.98 71.83 0.32
Sum of HR allowed greatly exceeds sum of HR hit
##
## Call:
## lm(formula = dfTrain$BATTING_HR ~ dfTrain$PITCHING_HR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -234.609 0.123 1.336 6.992 12.817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.592392 0.621547 -2.562 0.0105 *
## dfTrain$PITCHING_HR 0.957481 0.005087 188.217 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.87 on 2274 degrees of freedom
## Multiple R-squared: 0.9397, Adjusted R-squared: 0.9397
## F-statistic: 3.543e+04 on 1 and 2274 DF, p-value: < 2.2e-16
##
## Pearson's product-moment correlation
##
## data: dfTrain$BATTING_BB and dfTrain$PITCHING_BB
## t = 26.759, df = 2274, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4574724 0.5199930
## sample estimates:
## cor
## 0.4893613
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.647e-14 -1.120e-15 -7.000e-16 -2.800e-16 1.614e-12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.756e-13 3.928e-15 4.470e+01 <2e-16 ***
## dfTrain_ImputedMedian[, i] 1.000e+00 4.775e-17 2.094e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.467e-14 on 2172 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.385e+32 on 1 and 2172 DF, p-value: < 2.2e-16
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.761 -8.515 0.971 9.783 43.230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.686332 3.164963 5.588 2.58e-08 ***
## dfTrain_ImputedMedian[, i] 0.042775 0.002136 20.025 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.31 on 2172 degrees of freedom
## Multiple R-squared: 0.1558, Adjusted R-squared: 0.1555
## F-statistic: 401 on 1 and 2172 DF, p-value: < 2.2e-16
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.863 -9.376 0.670 10.121 57.415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.346919 1.737969 32.42 <2e-16 ***
## dfTrain_ImputedMedian[, i] 0.100118 0.007005 14.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.89 on 2172 degrees of freedom
## Multiple R-squared: 0.08597, Adjusted R-squared: 0.08555
## F-statistic: 204.3 on 1 and 2172 DF, p-value: < 2.2e-16
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.628 -8.980 1.143 10.428 60.940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.62804 0.72265 106.038 < 2e-16 ***
## dfTrain_ImputedMedian[, i] 0.07596 0.01180 6.439 1.48e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.43 on 2172 degrees of freedom
## Multiple R-squared: 0.01873, Adjusted R-squared: 0.01828
## F-statistic: 41.46 on 1 and 2172 DF, p-value: 1.477e-10
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -75.596 -9.734 0.553 10.041 68.954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.595947 0.658670 114.771 <2e-16 ***
## dfTrain_ImputedMedian[, i] 0.050009 0.005527 9.048 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.29 on 2172 degrees of freedom
## Multiple R-squared: 0.03632, Adjusted R-squared: 0.03588
## F-statistic: 81.87 on 1 and 2172 DF, p-value: < 2.2e-16
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.936 -9.554 0.579 9.674 78.185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.935670 1.370076 48.13 <2e-16 ***
## dfTrain_ImputedMedian[, i] 0.029358 0.002635 11.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.15 on 2172 degrees of freedom
## Multiple R-squared: 0.05406, Adjusted R-squared: 0.05362
## F-statistic: 124.1 on 1 and 2172 DF, p-value: < 2.2e-16
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.228 -9.308 0.963 10.609 63.772
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.228036 1.043434 78.81 <2e-16 ***
## dfTrain_ImputedMedian[, i] -0.001990 0.001344 -1.48 0.139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.57 on 2172 degrees of freedom
## Multiple R-squared: 0.001008, Adjusted R-squared: 0.0005482
## F-statistic: 2.192 on 1 and 2172 DF, p-value: 0.1389
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.284 -9.080 1.024 10.198 65.160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.28444 0.57917 135.166 < 2e-16 ***
## dfTrain_ImputedMedian[, i] 0.02048 0.00392 5.226 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.48 on 2172 degrees of freedom
## Multiple R-squared: 0.01242, Adjusted R-squared: 0.01196
## F-statistic: 27.31 on 1 and 2172 DF, p-value: 1.899e-07
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.071 -9.493 1.233 10.483 65.236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.07067 0.98260 81.489 <2e-16 ***
## dfTrain_ImputedMedian[, i] 0.01314 0.01750 0.751 0.453
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.57 on 2172 degrees of freedom
## Multiple R-squared: 0.0002595, Adjusted R-squared: -0.0002008
## F-statistic: 0.5637 on 1 and 2172 DF, p-value: 0.4528
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.165 -9.462 0.897 10.651 68.914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.0150688 0.5308401 156.384 < 2e-16 ***
## dfTrain_ImputedMedian[, i] -0.0012543 0.0002309 -5.432 6.2e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.47 on 2172 degrees of freedom
## Multiple R-squared: 0.0134, Adjusted R-squared: 0.01295
## F-statistic: 29.5 on 1 and 2172 DF, p-value: 6.205e-08
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.906 -9.846 0.705 9.965 67.942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.905514 0.682649 109.728 <2e-16 ***
## dfTrain_ImputedMedian[, i] 0.053432 0.005465 9.777 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.25 on 2172 degrees of freedom
## Multiple R-squared: 0.04216, Adjusted R-squared: 0.04171
## F-statistic: 95.59 on 1 and 2172 DF, p-value: < 2.2e-16
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.528 -9.251 0.948 10.415 70.006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.528116 1.149967 64.809 < 2e-16 ***
## dfTrain_ImputedMedian[, i] 0.011187 0.001975 5.664 1.68e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 2172 degrees of freedom
## Multiple R-squared: 0.01455, Adjusted R-squared: 0.0141
## F-statistic: 32.08 on 1 and 2172 DF, p-value: 1.678e-08
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.570 -9.402 0.970 10.484 63.430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.5704787 0.5945630 138.876 < 2e-16 ***
## dfTrain_ImputedMedian[, i] -0.0022085 0.0006023 -3.667 0.000252 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.53 on 2172 degrees of freedom
## Multiple R-squared: 0.006152, Adjusted R-squared: 0.005695
## F-statistic: 13.45 on 1 and 2172 DF, p-value: 0.0002515
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.638 -9.847 0.708 10.050 73.590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.645750 0.476605 175.503 <2e-16 ***
## dfTrain_ImputedMedian[, i] -0.011815 0.001415 -8.352 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.33 on 2172 degrees of freedom
## Multiple R-squared: 0.03112, Adjusted R-squared: 0.03067
## F-statistic: 69.75 on 1 and 2172 DF, p-value: < 2.2e-16
## NULL
##
## Call:
## lm(formula = dfTrain_ImputedMedian$TARGET_WINS ~ dfTrain_ImputedMedian[,
## i])
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.809 -9.322 1.075 10.459 65.191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.70498 2.23001 37.536 <2e-16 ***
## dfTrain_ImputedMedian[, i] -0.01979 0.01484 -1.334 0.182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.57 on 2172 degrees of freedom
## Multiple R-squared: 0.0008182, Adjusted R-squared: 0.0003582
## F-statistic: 1.779 on 1 and 2172 DF, p-value: 0.1825
## NULL
Trying a transformation on team fielding error. it improves it to some degree.
##
## Call:
## lm(formula = TARGET_WINS ~ FIELDING_E, data = dfTrain_ImputedMedian2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.638 -9.847 0.708 10.050 73.590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.645750 0.476605 175.503 <2e-16 ***
## FIELDING_E -0.011815 0.001415 -8.352 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.33 on 2172 degrees of freedom
## Multiple R-squared: 0.03112, Adjusted R-squared: 0.03067
## F-statistic: 69.75 on 1 and 2172 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ FIELDING_E + sq, data = dfTrain_ImputedMedian2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.981 -9.787 0.647 10.285 72.647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.985e+01 7.178e-01 111.246 < 2e-16 ***
## FIELDING_E 1.386e-02 3.924e-03 3.533 0.000419 ***
## sq -2.177e-05 3.108e-06 -7.005 3.29e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.17 on 2171 degrees of freedom
## Multiple R-squared: 0.05253, Adjusted R-squared: 0.05165
## F-statistic: 60.18 on 2 and 2171 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = dfTrain_ImputedMedian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.264 -8.466 0.163 8.273 58.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9560970 5.4876280 4.365 1.33e-05 ***
## INDEX -0.0004771 0.0003788 -1.259 0.207988
## BATTING_H 0.0482928 0.0037112 13.013 < 2e-16 ***
## BATTING_2B -0.0232530 0.0092311 -2.519 0.011841 *
## BATTING_3B 0.0595670 0.0169134 3.522 0.000437 ***
## BATTING_HR 0.0655424 0.0272468 2.406 0.016234 *
## BATTING_BB 0.0084691 0.0057882 1.463 0.143567
## BATTING_SO -0.0100510 0.0025721 -3.908 9.61e-05 ***
## BASERUN_SB 0.0254437 0.0044746 5.686 1.47e-08 ***
## BASERUN_CS 0.0006521 0.0161429 0.040 0.967780
## PITCHING_H -0.0009865 0.0003651 -2.702 0.006949 **
## PITCHING_HR 0.0116273 0.0240289 0.484 0.628514
## PITCHING_BB 0.0014808 0.0040999 0.361 0.718000
## PITCHING_SO 0.0028141 0.0009069 3.103 0.001941 **
## FIELDING_E -0.0186779 0.0024906 -7.499 9.31e-14 ***
## FIELDING_DP -0.1091373 0.0136377 -8.003 1.97e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.86 on 2158 degrees of freedom
## Multiple R-squared: 0.3226, Adjusted R-squared: 0.3179
## F-statistic: 68.5 on 15 and 2158 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_3B +
## BATTING_HR + BATTING_BB + BATTING_SO + BASERUN_SB + PITCHING_H +
## PITCHING_SO + FIELDING_E + FIELDING_DP, data = dfTrain_ImputedMedian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.153 -8.411 0.176 8.307 58.465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.6861348 5.2806294 4.296 1.82e-05 ***
## BATTING_H 0.0486089 0.0036841 13.194 < 2e-16 ***
## BATTING_2B -0.0233877 0.0092203 -2.537 0.011265 *
## BATTING_3B 0.0602198 0.0166990 3.606 0.000318 ***
## BATTING_HR 0.0770786 0.0097715 7.888 4.83e-15 ***
## BATTING_BB 0.0104799 0.0033563 3.122 0.001817 **
## BATTING_SO -0.0104007 0.0024834 -4.188 2.93e-05 ***
## BASERUN_SB 0.0253857 0.0042813 5.929 3.53e-09 ***
## PITCHING_H -0.0008928 0.0003178 -2.809 0.005008 **
## PITCHING_SO 0.0030690 0.0006625 4.633 3.82e-06 ***
## FIELDING_E -0.0184139 0.0024107 -7.639 3.28e-14 ***
## FIELDING_DP -0.1095211 0.0136173 -8.043 1.43e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.86 on 2162 degrees of freedom
## Multiple R-squared: 0.3218, Adjusted R-squared: 0.3184
## F-statistic: 93.27 on 11 and 2162 DF, p-value: < 2.2e-16
Understanding the role of double plays - remove the influence of hits:
## [1] -0.08895731
##
## Call:
## lm(formula = TARGET_WINS ~ FIELDING_DP + PITCHING_H, data = dfTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.999 -9.102 0.739 10.013 43.146
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.0829610 2.4592867 30.530 < 2e-16 ***
## FIELDING_DP -0.0045343 0.0121655 -0.373 0.709
## PITCHING_H 0.0041845 0.0008319 5.030 5.34e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.85 on 1987 degrees of freedom
## (286 observations deleted due to missingness)
## Multiple R-squared: 0.01377, Adjusted R-squared: 0.01278
## F-statistic: 13.87 on 2 and 1987 DF, p-value: 1.038e-06
##
## Call:
## lm(formula = TARGET_WINS ~ FIELDING_DP + PITCHING_H, data = dfTrain_ImputedMedian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.237 -9.564 0.855 10.359 68.964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.1139149 2.2975487 37.916 < 2e-16 ***
## FIELDING_DP -0.0271240 0.0147930 -1.834 0.0669 .
## PITCHING_H -0.0012921 0.0002317 -5.576 2.76e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 2171 degrees of freedom
## Multiple R-squared: 0.01493, Adjusted R-squared: 0.01402
## F-statistic: 16.45 on 2 and 2171 DF, p-value: 8.127e-08
##
## Call:
## lm(formula = TARGET_WINS ~ FIELDING_DP * PITCHING_H, data = dfTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.126 -9.261 1.004 9.713 47.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.023e+02 5.724e+00 17.872 < 2e-16 ***
## FIELDING_DP -2.549e-01 4.914e-02 -5.188 2.35e-07 ***
## PITCHING_H -1.244e-02 3.269e-03 -3.806 0.000145 ***
## FIELDING_DP:PITCHING_H 1.561e-04 2.970e-05 5.257 1.62e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.76 on 1986 degrees of freedom
## (286 observations deleted due to missingness)
## Multiple R-squared: 0.02731, Adjusted R-squared: 0.02584
## F-statistic: 18.59 on 3 and 1986 DF, p-value: 6.864e-12
##
## Call:
## lm(formula = TARGET_WINS ~ FIELDING_DP * PITCHING_H, data = dfTrain_ImputedMedian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.162 -9.515 0.820 10.312 69.257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.833e+01 5.757e+00 13.607 <2e-16 ***
## FIELDING_DP 3.302e-02 3.906e-02 0.845 0.3981
## PITCHING_H 3.513e-03 2.898e-03 1.212 0.2256
## FIELDING_DP:PITCHING_H -3.328e-05 2.001e-05 -1.663 0.0964 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 2170 degrees of freedom
## Multiple R-squared: 0.01618, Adjusted R-squared: 0.01482
## F-statistic: 11.9 on 3 and 2170 DF, p-value: 9.984e-08
The interaction temr makes a difference.
Taking a log of Pitching_H:
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H + logPitch_h, data = dfTrain_ImputedMedian5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.631 -9.694 1.045 10.242 64.174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.944e+01 9.013e-01 88.133 < 2e-16 ***
## PITCHING_H 1.126e-03 5.376e-04 2.094 0.0364 *
## logPitch_h -1.313e-07 2.682e-08 -4.897 1.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.39 on 2171 degrees of freedom
## Multiple R-squared: 0.02418, Adjusted R-squared: 0.02328
## F-statistic: 26.9 on 2 and 2171 DF, p-value: 2.895e-12
A closer look at Pitching_h. Taking out th outliers.
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H, data = dfTrain_ImputedMedian6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.864 -8.396 0.413 8.870 30.267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.26774 8.46728 0.976 0.329
## PITCHING_H 0.04990 0.00602 8.289 3.78e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.22 on 970 degrees of freedom
## Multiple R-squared: 0.06614, Adjusted R-squared: 0.06518
## F-statistic: 68.7 on 1 and 970 DF, p-value: 3.785e-16
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H, data = dfTrain_ImputedMedian7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.879 -13.887 2.392 15.885 65.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.487384 2.180193 41.50 < 2e-16 ***
## PITCHING_H -0.002207 0.000418 -5.28 2.77e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.87 on 255 degrees of freedom
## Multiple R-squared: 0.09855, Adjusted R-squared: 0.09502
## F-statistic: 27.88 on 1 and 255 DF, p-value: 2.767e-07
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H, data = dfTrain_ImputedMedian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.165 -9.462 0.897 10.651 68.914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.0150688 0.5308401 156.384 < 2e-16 ***
## PITCHING_H -0.0012543 0.0002309 -5.432 6.2e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.47 on 2172 degrees of freedom
## Multiple R-squared: 0.0134, Adjusted R-squared: 0.01295
## F-statistic: 29.5 on 1 and 2172 DF, p-value: 6.205e-08
Eliminting outliers has no effect - but show outliers seem to be grouped (compare new outliers with old):
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H, data = dfTrain_ImputedMedian_nooutliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.170 -9.460 0.889 10.636 68.905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.0181857 0.5306250 156.45 < 2e-16 ***
## PITCHING_H -0.0012530 0.0002307 -5.43 6.26e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 2169 degrees of freedom
## Multiple R-squared: 0.01341, Adjusted R-squared: 0.01296
## F-statistic: 29.49 on 1 and 2169 DF, p-value: 6.263e-08
looking for interactions:
Similar analysis with the data missing records:
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_H + BATTING_HBP + PITCHING_HR +
## PITCHING_BB + PITCHING_SO + FIELDING_E + FIELDING_DP, data = dfTrain_flag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2248 -5.6294 -0.0212 5.0439 21.3065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.95454 19.10292 3.191 0.001670 **
## BATTING_H 0.02541 0.01009 2.518 0.012648 *
## BATTING_HBP 0.08712 0.04852 1.796 0.074211 .
## PITCHING_HR 0.08945 0.02394 3.736 0.000249 ***
## PITCHING_BB 0.05672 0.00940 6.034 8.66e-09 ***
## PITCHING_SO -0.03136 0.00728 -4.308 2.68e-05 ***
## FIELDING_E -0.17218 0.03970 -4.338 2.38e-05 ***
## FIELDING_DP -0.11904 0.03516 -3.386 0.000869 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.422 on 183 degrees of freedom
## (2080 observations deleted due to missingness)
## Multiple R-squared: 0.5345, Adjusted R-squared: 0.5167
## F-statistic: 30.02 on 7 and 183 DF, p-value: < 2.2e-16
Only interaction appears with the fielding_errors. Hwoever, If we interact with itself it greatly improves the r squared.
##
## Call:
## lm(formula = TARGET_WINS ~ Pitch_h_squared, data = dfTrain_ImputedMedian9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.015 -9.069 0.997 10.158 66.609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.119e+01 3.359e-01 241.736 < 2e-16 ***
## Pitch_h_squared -8.054e-08 1.147e-08 -7.024 2.88e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.4 on 2172 degrees of freedom
## Multiple R-squared: 0.02221, Adjusted R-squared: 0.02176
## F-statistic: 49.33 on 1 and 2172 DF, p-value: 2.883e-12
##
## Call:
## lm(formula = TARGET_WINS ~ Pitch_h_log, data = dfTrain_ImputedMedian9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.408 -9.582 1.145 10.356 66.161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.2807 7.9389 10.994 <2e-16 ***
## Pitch_h_log -0.8795 1.0706 -0.822 0.411
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.57 on 2172 degrees of freedom
## Multiple R-squared: 0.0003106, Adjusted R-squared: -0.0001496
## F-statistic: 0.6749 on 1 and 2172 DF, p-value: 0.4114
##
## Call:
## lm(formula = TARGET_WINS ~ Pitch_h_sqrt, data = dfTrain_ImputedMedian9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.753 -9.477 0.982 10.732 68.378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.48013 1.47144 58.09 < 2e-16 ***
## Pitch_h_sqrt -0.11429 0.03474 -3.29 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.54 on 2172 degrees of freedom
## Multiple R-squared: 0.00496, Adjusted R-squared: 0.004501
## F-statistic: 10.83 on 1 and 2172 DF, p-value: 0.001017
##
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H * Pitch_h_Under1500, data = dfTrain_ImputedMedian8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.864 -9.153 0.979 9.772 67.940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.643e+01 6.550e-01 131.965 < 2e-16 ***
## PITCHING_H -1.771e-03 2.322e-04 -7.628 3.55e-14 ***
## Pitch_h_Under1500 -7.816e+01 1.047e+01 -7.466 1.19e-13 ***
## PITCHING_H:Pitch_h_Under1500 5.167e-02 7.432e-03 6.952 4.76e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.08 on 2170 degrees of freedom
## Multiple R-squared: 0.06361, Adjusted R-squared: 0.06232
## F-statistic: 49.14 on 3 and 2170 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ FIELDING_E * Pitch_h_Under1500, data = dfTrain_ImputedMedian9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.182 -9.571 0.598 9.826 73.499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.867745 0.643380 136.572 < 2e-16 ***
## FIELDING_E -0.016158 0.001498 -10.787 < 2e-16 ***
## Pitch_h_Under1500 -0.776515 1.469068 -0.529 0.597
## FIELDING_E:Pitch_h_Under1500 -0.042078 0.008364 -5.031 5.28e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.88 on 2170 degrees of freedom
## Multiple R-squared: 0.08892, Adjusted R-squared: 0.08766
## F-statistic: 70.59 on 3 and 2170 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ FIELDING_E, data = dfTrain_ImputedMedian9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.638 -9.847 0.708 10.050 73.590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.645750 0.476605 175.503 <2e-16 ***
## FIELDING_E -0.011815 0.001415 -8.352 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.33 on 2172 degrees of freedom
## Multiple R-squared: 0.03112, Adjusted R-squared: 0.03067
## F-statistic: 69.75 on 1 and 2172 DF, p-value: < 2.2e-16
Final Mods:
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_3B +
## BATTING_BB + BATTING_SO + BASERUN_SB + PITCHING_H + PITCHING_BB +
## FIELDING_E + FIELDING_DP + Missing_Flag + Pitch_h_Under1500 +
## inter_H_Itself + Inter_H_Err + BB_sq + BHR_sq + BSO_sq +
## PH_sq + Inter_E_Cohort + Inter_bhr_Cohort + Inter_bbb_Cohort +
## Inter_bs_Cohort, data = dfTrain_Final2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.224 -7.883 0.383 7.828 58.494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.862e+01 7.211e+00 6.743 1.96e-11 ***
## BATTING_H 5.651e-02 3.809e-03 14.836 < 2e-16 ***
## BATTING_2B -1.332e-02 8.793e-03 -1.515 0.129953
## BATTING_3B 8.266e-02 1.590e-02 5.199 2.18e-07 ***
## BATTING_BB -2.197e-01 1.816e-02 -12.094 < 2e-16 ***
## BATTING_SO 4.104e-02 7.230e-03 5.676 1.55e-08 ***
## BASERUN_SB 3.979e-02 4.284e-03 9.288 < 2e-16 ***
## PITCHING_H -3.941e-03 1.044e-03 -3.776 0.000164 ***
## PITCHING_BB 1.909e-02 3.295e-03 5.793 7.87e-09 ***
## FIELDING_E -3.595e-02 2.968e-03 -12.115 < 2e-16 ***
## FIELDING_DP -8.629e-02 1.282e-02 -6.729 2.16e-11 ***
## Missing_Flag 2.730e+01 1.156e+01 2.362 0.018238 *
## Pitch_h_Under1500 3.344e+01 9.425e+00 3.548 0.000396 ***
## inter_H_Itself -1.719e-02 6.503e-03 -2.643 0.008276 **
## Inter_H_Err -3.658e-02 7.000e-03 -5.226 1.90e-07 ***
## BB_sq -2.041e-04 1.589e-05 -12.848 < 2e-16 ***
## BHR_sq -2.731e-04 3.515e-05 -7.767 1.21e-14 ***
## BSO_sq 3.267e-05 4.696e-06 6.957 4.55e-12 ***
## PH_sq -8.681e-08 3.439e-08 -2.524 0.011660 *
## Inter_E_Cohort -1.997e-01 2.698e-02 -7.401 1.89e-13 ***
## Inter_bhr_Cohort 3.348e-01 1.599e-01 2.094 0.036350 *
## Inter_bbb_Cohort 4.961e-02 1.895e-02 2.618 0.008908 **
## Inter_bs_Cohort 4.692e-02 2.717e-02 1.727 0.084256 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.22 on 2253 degrees of freedom
## Multiple R-squared: 0.4041, Adjusted R-squared: 0.3983
## F-statistic: 69.44 on 22 and 2253 DF, p-value: < 2.2e-16
## [1] 0.3183679
## [1] 0.3776131
## [1] 0.398272
Checking interactions with the missing vaolues cohort:
looking for interactions:
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_BB, data = dfTrain_ImputedMean_NoCohort1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.936 -9.554 0.579 9.674 78.185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.935670 1.370076 48.13 <2e-16 ***
## BATTING_BB 0.029358 0.002635 11.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.15 on 2172 degrees of freedom
## Multiple R-squared: 0.05406, Adjusted R-squared: 0.05362
## F-statistic: 124.1 on 1 and 2172 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_BB + BB_sq, data = dfTrain_ImputedMean_NoCohort1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.421 -9.315 0.582 9.742 72.271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.442e+01 2.462e+00 30.229 < 2e-16 ***
## BATTING_BB -1.398e-02 1.079e-02 -1.296 0.195
## BB_sq -4.958e-05 1.197e-05 -4.142 3.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.09 on 2171 degrees of freedom
## Multiple R-squared: 0.06147, Adjusted R-squared: 0.06061
## F-statistic: 71.1 on 2 and 2171 DF, p-value: < 2.2e-16
pitching SO has 20 zeroes which looks like missing values. Also, eliminate the 0 wins record.
## INDEX TARGET_WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB
## 1 325 120 2270 301 132 42 74
## 2 326 146 2305 322 111 29 64
## 3 435 65 1464 147 32 3 94
## 4 459 23 1458 220 35 0 93
## 5 952 77 1895 244 8 8 93
## 6 953 73 1685 206 31 0 58
## 7 1106 49 1794 281 58 6 79
## 8 1107 107 1725 194 67 4 79
## 9 1347 0 891 135 0 0 0
## 10 1498 24 1289 145 41 7 45
## 11 1502 105 1767 249 77 20 95
## 12 1503 71 1491 200 57 17 50
## 13 2037 97 1903 256 50 18 71
## 14 2038 118 2086 280 135 22 89
## 15 2048 81 1927 207 142 8 78
## 16 2049 88 1622 155 67 12 52
## 17 2253 34 1177 171 9 0 119
## 18 2254 93 1527 200 64 0 79
## 19 2486 12 1009 112 75 0 12
## 20 2493 29 1122 69 64 0 29
## BATTING_SO BASERUN_SB BASERUN_CS PITCHING_H PITCHING_HR PITCHING_BB
## 1 0 124.7618 52.80386 5253 97 171
## 2 0 124.7618 52.80386 4727 59 131
## 3 0 124.7618 52.80386 4312 9 277
## 4 0 124.7618 52.80386 16871 0 1076
## 5 0 124.7618 52.80386 5203 22 255
## 6 0 124.7618 52.80386 4074 0 140
## 7 0 124.7618 52.80386 5484 18 241
## 8 0 124.7618 52.80386 3408 8 156
## 9 0 0.0000 0.00000 24057 0 0
## 10 0 124.7618 52.80386 4443 24 155
## 11 0 124.7618 52.80386 4404 50 237
## 12 0 124.7618 52.80386 3552 41 119
## 13 0 124.7618 52.80386 5605 53 209
## 14 0 124.7618 52.80386 4629 49 198
## 15 0 124.7618 52.80386 5382 22 218
## 16 0 124.7618 52.80386 3864 29 124
## 17 0 124.7618 52.80386 10035 0 1015
## 18 0 124.7618 52.80386 3638 0 188
## 19 0 124.7618 52.80386 12574 0 150
## 20 0 124.7618 52.80386 6492 0 168
## PITCHING_SO FIELDING_E FIELDING_DP
## 1 0 1058 146.3879
## 2 0 951 146.3879
## 3 0 1473 146.3879
## 4 0 1898 146.3879
## 5 0 1225 146.3879
## 6 0 931 146.3879
## 7 0 1531 146.3879
## 8 0 853 146.3879
## 9 0 1890 146.3879
## 10 0 1506 146.3879
## 11 0 1092 146.3879
## 12 0 1253 146.3879
## 13 0 1166 146.3879
## 14 0 928 146.3879
## 15 0 1447 146.3879
## 16 0 1132 146.3879
## 17 0 1279 146.3879
## 18 0 1010 146.3879
## 19 0 847 146.3879
## 20 0 1522 146.3879